Boosting Statistical Word Alignment Using Labeled and Unlabeled Data
نویسندگان
چکیده
This paper proposes a semi-supervised boosting approach to improve statistical word alignment with limited labeled data and large amounts of unlabeled data. The proposed approach modifies the supervised boosting algorithm to a semisupervised learning algorithm by incorporating the unlabeled data. In this algorithm, we build a word aligner by using both the labeled data and the unlabeled data. Then we build a pseudo reference set for the unlabeled data, and calculate the error rate of each word aligner using only the labeled data. Based on this semisupervised boosting algorithm, we investigate two boosting methods for word alignment. In addition, we improve the word alignment results by combining the results of the two semi-supervised boosting methods. Experimental results on word alignment indicate that semisupervised boosting achieves relative error reductions of 28.29% and 19.52% as compared with supervised boosting and unsupervised boosting, respectively.
منابع مشابه
Semi-Supervised Block ITG Models for Word Alignment
Labeled training data for the word alignment task, in the form of word-aligned sentence pairs, is hard to come by for many language-pairs. Hence, it is natural to draw upon semi-supervised learning methods (Fraser and Marcu, 2006). We introduce a semisupervised learning method for word alignment using conditional entropy regularization (Grandvalet and Bengio, 2005) on top of a BITG-based discri...
متن کاملMulticlass Semi-supervised Boosting Using Different Distance Metrics
The goal of this thesis project is to build an effective multiclass classifier which can be trained with a small amount of labeled data and a large pool of unlabeled data by applying semi-supervised learning in a boosting framework. Boosting refers to a general method of producing a very accurate classifier by combining rough and moderately inaccurate classifiers. It has attracted a significant...
متن کاملBoosting Statistical Word Alignment
This paper proposes an approach to improve statistical word alignment with the boosting method. Applying boosting to word alignment must solve two problems. The first is how to build the reference set for the training data. We propose an approach to automatically build a pseudo reference set, which can avoid manual annotation of the training set. The second is how to calculate the error rate of...
متن کاملExploiting Unlabeled Data Using Improved Natural Langua
This paper presents an unsupervised method that uses limited amount of labeled data and a large pool of unlabeled data to improve natural language call routing performance. The method uses multiple classifiers to select a subset of the unlabeled data to augment limited labeled data. We evaluated four widely used text classification algorithms; Naive Bayes Classification (NBC), Support Vector ma...
متن کاملImproving BAS committee performance with a semi-supervised approach
Semi-supervised Learning is a machine learning approach that, by making use of both labeled and unlabeled data for training, can significantly improve learning accuracy. Boosting is a machine learning technique that combines several weak classifiers to improve the overall accuracy. At each iteration, the algorithm changes the weights of the examples and builds an additional classifier. A well k...
متن کامل